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Abstract 



Trypanosoma cruzi and Giardia intestinalis are two human pathogens and 
protozoan parasites responsible for the diseases Chagas disease and giardia- 
sis, respectively. Both diseases cause suffering and illness in several million 
individuals. The former disease occurs primarily in South America and Cen- 
tral America, and the latter disease occurs worldwide. Current therapeutics 
are toxic and lack efficacy, and potential vaccines are far from the market. In- 
creased knowledge about the biology of these parasites is essential for drug and 
vaccine development, and new diagnostic tests. In this thesis, high-throughput 
sequencing was applied together with extensive bioinformatic analyses to yield 
insights into the biology and evolution of Trypanosoma cruzi and Giardia in- 
testinalis. Bioinformatics analysis of DNA and RNA sequences was performed 
to identify features that may be of importance for parasite biology and func- 
tional characterization. This thesis is based on five papers (i-v). Paper i and ii 
describe comparative genome studies of three distinct genotypes of Giardia in- 
testinalis (A, B and E). The genome- wide divergence between A and B was 23% 
and 13% between A and E. 4557 groups of three-way orthologs were defined 
across the three genomes. 5 to 38 genotype-specific genes were identified, along 
with genomic rearrangements. Genes encoding surface antigens, vsps, had un- 
dergone extensive diversification in the three genotypes. Several bacterial gene 
transfers were identified, one of which encoded an acetyltransferase protein in 
the E genotype. Paper Hi describes a genome comparison of the human infect- 
ing Trypanosoma cruzi with the bat-restricted subspecies Trypanosoma cruzi 
marinkellei. The human infecting parasite had an 11% larger genome, and was 
found to have expanded repertoires of sequences related to surface antigens. 
The two parasites had a shared 'core' gene complement. One recent horizontal 
gene transfer was identified in T. c. marinkellei, representing a eukaryote- 
to-eukaryote transfer from a photosynthesizing organism. Paper iv describes 
the repertoire of small non-coding RNAs in Trypanosoma cruzi epimastigotes. 
Sequenced small RNAs were in the size range 16 to 61 nucleotides, and the 
majority were derived from transfer RNAs and other non-coding RNAs. 92 
novel transcribed loci were identified in the genome, 79 of which were without 
similarity to known RNA classes. One population of small RNAs were derived 
from protein-coding genes. Paper v describes transcriptome analysis using 
paired-end RNA-Seq of three distinct genotypes of Giardia intestinalis (A, B 
and E). Gene expression profiles recapitulated the known phylogeny of the ex- 
amined genotypes, and 61 to 176 genes were differentially expressed. 49,027 
distinct polyadenylation sites were mapped and compared, and the median 
3'UTR length was 80 nucleotides (A). One 36-nt novel intron was identified 
and the previously reported introns (5) were confirmed. 
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1 SYNOPSIS 



1 Synopsis 

Nothing in biology makes sense except in the light of evolution 
- T.G. Dobzhansky (1973), geneticist and evolutionary biologist. 



Infectious diseases are leading causes of suffering and death of humans 
around the world, and have significant impact on daily lives of many million 
people. Parasites cause some of the worst and most neglected diseases, includ- 
ing Malaria, African- and American trypanosomiasis, Schistosomiasis and sev- 
eral others. These diseases are most prevalent in tropical regions of the world, 
and are often associated with poverty as well as being intrinsically "poverty 
promoting." Unsafe drinking water, compromised hygiene and sanitary con- 
ditions or substandard housing are factors that facilitate disease. Many of the 
afflicted individuals have very limited access to health care. Several factors 
can be attributed to the lack of treatment options: neglected diseases attract 
little attention from pharmaceutical companies and first-world governments, 
often because companies are unable to regain investments in expensive basic 
research, drug development and clinical trials. Moreover, many parasites are 
difficult to study in the laboratory due to complex life cycles or because the 
parasites do not readily grow in vitro. Most neglected diseases do not cause 
acute outbreaks, and instead progress during many years and in the meantime 
cause debilitating illness and suffering. 

In addition to the mission of improving human health, parasites often have 
unique or specialized biological features, which makes them excellent models 
for the study of eukaryotic evolution. Parasites provide a window into the 
biological and social evolution of our own species, since many parasites have 
co-evolved together with the Homo lineage for many millions of years; this 
appears to be the situation for the worms Trichuris trichiura and Enterobius 
[l] [2] ; other parasites such as Trypanosoma cruzi tell a shorter story of human 
co-evolution. Parasite-host co-evolution has likely resulted in reciprocal adap- 
tations with complex evolutionary consequences, for example the favouring of 
specific genetic processes such as recombination that operate to create new 
genotypes to which the host is not adapted. Another example of co-evolution 
can be found in our own species, where a trypanolytic factor encoded by our 
genome is likely an ancestral adaptation against trypanosomatid parasites of 
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the T. brucei clade [3j. 

Second generation sequencing enables cost-efficient and rapid acquisition 
of large data sets covering diverse biological aspects, including but not limited 
to genome, metagenome, epigenome and transcriptome studies. An important 
target of the new technologies is human parasites, aiming to deepen our under- 
standing of the underlying biology of these pathogens, and their evolutionary 
trajectories. Such efforts may reveal signatures relating to how parasitism 
evolved. Second generation sequencing has already facilitated key insights 
into the molecular organization of these organisms, and rapidly enables ad- 
vancement of functional studies, identification of drug targets and formation 
of new hypotheses. Old questions relating to pathogenicity, epidemiology and 
genetics can be addressed with the new tools and may ultimately lead to 
insights that pave the way for better treatment strategies for neglected and 
tropical diseases. 

(S/SMSVJ) 
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Fossil bones and footsteps and ruined homes are the solid facts 
of history, but the surest hints, the most enduring signs, lie in 
those miniscule genes. For a moment we protect them with 
our lives, then like relay runners with a baton, we pass them 
on to be carried by our descendents. There is a poetry in 
genetics which is more difficult to discern in broken bones, and 
genes are the only unbroken living thread that weaves back and 
forth through all those boneyards. 

- J. Kingdon, biologist and science author (1996). 

2.1 Current State of Genome Sequencing 

Genomes vary enormously in size, with many of the size differences being 
caused by repeated sequences (Figure [IJ . Deoxyribonucleic acid (DNA) se- 
quencing techniques are limited to reading short pieces of DNA. The most 
common strategy to overcome this limitation is shotgun sequencing. The 
technique involves random fragmentation of chromosomes to a redundant mix 
of small DNA fragments. These fragments can subsequently be sequenced, 
resulting in a mix of sequences of forward and reverse directions, represent- 
ing the original chromosomes. By using the overlap of these short sequences, 
computer programs can reconstruct millions of short sequences into longer 
sequences (contigs) - a process referred to as genome assembly. While simple 
in theory, the task becomes less straightforward due to the following obsta- 
cles: (i) the large size of most genomes, especially those of eukaryotes; (ii) 
the vast amount of sequence data needed to achieve sufficient redundancy, i.e. 
the genome "coverage"; (Hi) the fact that many DNA fragments are identi- 
cal or close to identical (repeats); (iv) heterozygosity, i.e. single nucleotide 
polymorphisms between homologous chromosomes; and (v) sequence errors. 
Apart from these issues, genomes can exhibit aneuploidy and complex kary- 
otypes - all of which make the assembly task more difficult. Sanger sequencing 
is referred to as 'first generation sequencing,' and has been used to sequence 
large and complex genomes, e.g. the human genome. Sanger sequencing can 
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read up to 1000 base pairs (bp) of a DNA fragment. Despite being relatively 
old, it is still the most common technique for low-throughput applications, 
e.g. DNA amplified from Polymerase Chain Reaction (PCR). A comparison 
of the repeat content in relation to assembly consistency of various draft or 
complete genomes is shown in Figure [2] 
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Figure 1: Genome sizes on a logarithmic scale. Phage A has a genome of 48,502 
bp and it was determined by Sanger et al. in 1982 [|. Most genomes of sequenced, 
parasitic protozoa are between 4 to 100 Mb in size. The genome size of T. cruzi 
refers to both haplotypes of the CL Brener strain. For the other eukaryotes the 
genome size refers to the haploid state. (Green dots) Single cell protozoa. Genomes 
of mammals are at least an order of magnitude larger. Marbled lungfish has the 
largest genome of any known organism (130,000 Mb) [5]. 



Second generation sequencing (SGS) offers higher throughput, at lower 
cost per base, but yields shorter sequences (reads). Short reads are often a 
problem for determining a genome sequence, as most genomes contain repet- 
itive sequences longer than the read length [6]. Paired-end protocols have 
been developed to tackle repeats, and allow sequencing of longer DNA frag- 
ments from both ends in order to bridge repeats. Paired-end reads of various 
sizes can subsequently be used to link contigs into scaffolds. Hence, 'scaf- 
folds' is the genomics term for contigs that are ordered and oriented. Several 
SGS techniques or platforms are available, for example Roche/454 sequenc- 
ing, developed from pyrosequencing. The platform from Illumina provides 
significantly higher throughput, albeit at shorter read lengths. Most genome 
sequencing efforts combine data from different platforms to overcome their re- 
spective limitations. Repetitive sequences are currently the major bottleneck 
in genome sequencing projects, especially since most eukaryotes contain vari- 
ous classes of repeats, e.g. retroelements and segmental duplications. Future 
developments are anticipated to improve genome sequences, including Pacific 
Biosciences and perhaps more distant, Nanopore sequencing u\. 
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Figure 2: Comparison of genome repeat-content and assembly fragmentation of 
various eukaryotic pathogens from EupathDB II]. Repeat libraries were established 
for each genome using RepeatScout [9] (repeats >500 bp of length), and contigs >200 
bp of each genome were searched using RepeatMasker JlO] . (X-axis) Percentage of 
the genome (sum of contig lengths) present in repeats. (Y-axis) Contig count of the 
assembly (logarithmic scale). Strains: B. bovis (T2Bo); C. fasciculata (Cf-Cl); C. 
hominis (TU502); E. intestinalis (ATCC 50506); E. histolytica (HM-1:IMSS); G. 
intestinalis (WB); L. braziliensis (M2903); P. vivax (Sal-l); T. gondii (GT1); T. 
vaginalis (G3); T. brucei (427); T. congolense (IL3000); T. cruzi (CL Brener); and 
T. vivax (Y486). 



2.2 RNA-Seq - A Method to Read the Transcriptome 
at Single Nucleotide Resolution 

The field of transcriptomics aims to describe all transcripts in a cell or tis- 
sue and to determine the features of these transcripts: for example 5' and 
3' end structures, splicing patterns and quantitative information about tran- 
script levels. Transcriptome sequencing (RNA-Seq) relies on deep sequencing 
of fragmented cDNA libraries, achieving its quantitative properties by the 
amounts of sequence reads of a particular transcript |11| , i.e. abundant tran- 
scripts yield more reads whereas more rare transcripts yield fewer. Therefore, 
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the final digital expression values are based on simple counting statistics, 
which are often normalized by gene length and the number of sequences gen- 
erated by the instrument. While microarray data often require complicated 
normalization procedures, processing of RNA-Seq data is relatively simple and 
straightforward. RNA-Seq results in lower background noise than microarrays 



12 . Any high-throughput sequencing platform can be utilized, but Illumina 
has been the most widely adopted and is currently the best supported in terms 
of bioinformatics software. RNA-Seq enables ab initio discovery of new and 
rare transcripts and splicing patterns, which cannot be observed on standard 
microarrays. Several software programs are freely available to process RNA- 
Seq data. These programs rely on optimized algorithms to map (align) large 
quantities of sequence data, and at the same time consider polymorphisms 
and sequence errors. RNA-Seq has rapidly become widely adopted and ap- 
plied to various research questions, including differential expression analysis 

13 , small ribonucleic acid (RNA) discovery, allele-specific expression 14 as 



15 



well as mapping 5' and 3' ends of genes 

Recent developments allow the generation of paired-end libraries, read 
lengths up to 150 nucleotides and strand specificity. These improvements al- 
low going beyond the mRNA component of the transcriptome and sampling 
hidden transcriptional layers. Drawbacks of the RNA-Seq method include: 
(i) the cost of library preparation and sequencing; (ii) lack of user-friendly 
analysis pipelines and interfaces; (ra) RNA or cDNA must be fragmented 
into smaller pieces, usually between 100 to 500 nt; (iv) library preparation 
and fragment amplification may introduce artifacts or biases; (v) transcript 
coverage bias is common, i.e. coverage fluctuations along the 5' to 3' axis of 
the mRNA; (vi) long-time storage of RNA-Seq data sets is becoming increas- 
ingly difficult because of large data volumes; and (to) certain downstream 
analysis tasks, e.g. discovery of rare transcript isoforms and splice variants, 
still suffer from many false positives due to artifactual chimeras or amplifi- 
cation biases from library preparation or sequencing. Future developments 
in single molecule sequencing may ameliorate such problems. Despite these 
limitations, RNA-Seq is likely to improve and provide novel insight into the 
transcriptomes of protozoans and other eukaryotes. 
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2.3 Giardia intestinalis — A Gastrointestinal Parasite of 
Humans and Animals 

Giardia intestinalis is a protozoan, amitochondrial parasite and member of the 
diplomonad group of species, which includes other anaerobic or microaerophilic 
protozoans. The diplomonad group is part of the supergroup Excavata [16]. In 
the literature, the species names G. intestinalis, G. duodenalis and G. lamblia 
are used interchangeably and refer to the same organism. The parasite was 



discovered already in 1681 by the Dutch microscopist van Leeuwenhoek 17 



and described in more detail in 1859 by the Czech physician Lambl 18 . G 



intestinalis infects humans and animals and is one of the most prevalent gas- 



trointestinal parasites worldwide 19 . G. intestinalis is a potential zoonotic 
pathogen since it can infect a broad range of mammals in addition to humans. 
In man, the parasite colonizes the upper part of the small intestine and ad- 
heres to the mucosa along the sides of villi. The infection causes diarrhea, and 



may lead to malnutrition and failure of children to thrive 20 . The disease 
is particularly a burden in developing countries, where compromised hygiene 
may increase transmission and cause endemic outbreaks. Local outbreaks do 
occasionally occur in developed countries, for example via the public water 
supply [2l] or in day care centers [22]. In 2004 a large outbreak of G. intesti- 
nalis occurred in Bergen Norway, with altogether 1,300 laboratory-confirmed 
cases 23 . Since the outbreak certain individuals have had prolonged and 



recurring symptoms of giardiasis, with a profound impact on the quality of 



life 24 . Recent data indicate a putative relationship between irritable bowel 
syndrome and previous G. intestinalis infection [25] . 

As of 2004, G. intestinalis has been included in the WHO Neglected Dis- 



ease Initiative 26 . 



2.3.1 Cell Biology and Life Cycle: Regression and Simplicity 

In contrast to other protozoan parasites, G. intestinalis has a relatively sim- 
ple life cycle, consisting of the dormant cyst stage and the replicative tropho- 
zoite stage. Trophozoites have a characteristic half pear-shaped morphology, 
and are 12-15 /im long, 5-9 fim wide and 1-2 /im thick (Figure [3]). Tropho- 
zoites have four pairs of flagella, which are anchored to the cytoskeleton. The 
parasite rotates around its longitudinal axis to create a forward propulsion 
force. The rotation causes the parasite to move at a speed of 12-40 /im/s 



27 . An adhesion disk is present on the ventral surface of the parasite, and 
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is used for anchoring to substrates. Hernandez-Sanchez et al. reported that 
adhesion-deficient G. intestinalis had reduced capacity to establish infection 
in Mongolian gerbils 128] . When the parasite anchors to substrates, whether 
artificial or natural, the pattern of motion changes to more stable planar swim- 
ming 27 . Unusually compared to most eukaryotes, trophozoite cells contain 



two transcriptionally active nuclei [20] . Each nucleus contains a diploid to 
tetraploid set of the genome 29 . The biological significance of the polyploid 
genome is not clear, but it is a shared feature among many diplomonads 
(order Diplomonadida) and likely relates to the evolutionary history of the 
order. G. intestinalis has a well-defined cndoplasmatic reticulum, which can 
form excretory vesicles 30 . G. intestinalis lacks a canonical Golgi apparatus 
and mitochondria. A vestigial organelle called mitosome is present. Mito- 
somes are double-membraned structures that appear to be involved in iron 



metabolism 31 





Figure 3: G. intestinalis trophozoites seen through a microscope. (Green) Tagged 
median body protein. (Blue) DAPI stained nuclei. Image credit: J. Jerlstrom- 
Hultqvist. 

Cysts are the non-motile and metabolically dormant stage of the life cy- 
cle, and represent the infectious agents of giardiasis. Cysts can persist in the 
environment for prolonged periods and remain infectious, being encapsulated 
in a thick cyst wall of carbohydrate and protein. The most common route of 
transmission is the fecal-oral route, via contaminated food or water. Infection 
can also occur via person-to-person contact due to poor hygiene. The infec- 
tious dose can be as low as 10 cysts, as shown in a Texas prison "volunteer" 
population in 1954 132]. Ingested cysts undergo excystation inside the host, 
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triggered by stomach acids. Cysts then rupture in the small intestine. Giar- 
diasis is characterized by watery diarrhea, gastric pain and weight loss, and 
often but not always resolves spontaneously. The pathophysiology of giardia- 
sis is poorly understood, but likely involves dysfunction of the epithelial cell 



barrier of the intestine and disturbances of the electrolyte balance 33 . The 



infection triggers apoptosis of host epithelial cells, and causes shortening of 



the brush border villi, all of which may contribute to diarrhoea 20 . There 



is also evidence that proteolytic enzymes released by G. intestinalis are in- 
volved in the disease [34]. Encystation is the process where trophozoites are 
transformed back to cysts, and it is triggered by the intestinal environment 
(high levels of bile, low cholesterol and/or shift in pH) [20], 



2.3.2 Is G. intestinalis a Primitive Eukaryote or Highly Adapted 
Towards Parasitism? 

Phylogenies based on nucleotide and protein sequences have consistently iden- 



tified G. intestinalis as a basal eukaryote 35 36 ,37 38 . This view has been 



corroborated by the apparent lack of some intracellular compartments (e.g. 
mitochondria, Golgi and peroxisomes) and an overall simplified, bacterial-like 



metabolism 19 



Roger et al. reported the finding of the mitochondrial-like gene cpn60 in 



the genome of G. intestinalis 39 . The same year Hashimoto et al. reported 



the finding of a nuclear-encoded valyl-tRNA synthetase gene 40 , which is 



regarded to be of mitochondrial origin in eukaryotes. A mitochondria-derived 



organelle, the mitosome, was later discovered 31 . Together these data sug- 



gest that G. intestinalis diverged after the endosymbiosis of the mitochondrial 
ancestor, but subsequently lost this feature, possibly as an adaptation to the 
microaerophilic life in the intestine. The finding of nucleoli also points in the 



direction of a typical eukaryote 41 . The early-branching position of G. in- 



testinalis in phylogcnetic trees has been questioned as an artifact caused by 
long-branch attraction (LBA) |42| . The problem of LBA arises when compar- 
ing taxa with variable evolutionary rates, which may lead to the artifactual 



early emergence of these taxa 43 . The effect of LBA may be mitigated by 



inclusion of additional species. Analysis of small nucleolar RNAs from Ar- 
chaea and various unicellular eukaryotes has suggested that G. intestinalis 
emerged later than Trypanosoma and Euglena [44] . Altogether the current 
data of this parasite indicate that it has undergone reductive evolution and is 
highly adapted towards parasitism. 
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2.3.3 Intraspecific Taxonomy: Two Genotypes Infect Humans 

G. intestinalis propagates via binary fission, i.e. it is an asexual process. 
Whether G. intestinalis participates in rare sexual events has been subject 
of debate, but there is currently no direct evidence for a sexual or parasex- 
ual cycle. Tibayrenc et al. showed that the parasite meets the criteria for 
a predominantly clonal population structure 45 . Nevertheless, recent data 
have suggested the possibility of infrequent genetic exchange 46 47 . Many 
human-infective protozoans have sexual cycles or infrequently participate in 



genetic exchange; including Trypanosoma cruzi 48 , Toxoplasma gondii 49 



and Plasmodium falciparum [50| . Predominant clonal propagation does not 
preclude the existence of rare sexual events, but several questions are unre- 
solved or inconsistent with a conventional sexual organism. 

G. intestinalis has been suggested to comprise a species complex, consist- 
ing of eight distinct but morphologically indistinguishable genotypes (assem- 
blages; Figure [2]). Of the eight recognized assemblages (A to H), only two (A 
and B) infect humans as well as various non-human primates, cattle and many 
other animals 51 . Population studies have revealed further substructure of 
assemblage A, which can be subgrouped into AI, All and AIII. Variation in 
pathogenicity among strains has been documented 52 53 , indicating a puta- 
tive relationship between genotype and symptomatology. However, attempts 
to associate genotype with disease outcome have often been conflicting, and 
there is currently no certain relationship. In contrast, assemblage B has no 
clear subgrouping 51 . Only assemblage B has been used in experimental 
human infections [54] - 

Assemblages C to H are not associated with human infections, display 
stronger host-specificity and are less studied due to the difficulty to cultivate 
them in vitro. Parasites from assemblages C and D have been identified in 
dogs, wolves, coyotes and cats; E parasites have been found in cattle, sheep, 
pigs, goats and water buffalo; F parasites are reported mainly in cats; G para- 
sites are mainly in rodents 51 . H parasites were relatively recently discovered 
in marine vertebrates 55 . The phylogenetic topology of the assemblages sug- 
gests that the extant A, E and F lineages share a common ancestor. It is 
possible that animal domestication provided opportunities for parasites to 
cross species boundaries and thereby adapt to new hosts. 
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B humans, less frequent in livestock 
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D dogs, foxes and coyotes 
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Figure 4: Maximum likelihood phylogeny of G. intestinalis assemblages A to 
G based on the efla gene (nucleotide sequences). Sequences were aligned with 
ClustalW2 and the topology was inferred using the Tamura-Nei model as imple- 
mented in MEGA5 56 . The scale bar refers to substitutions per site. Support 
values at branches were generated from 1000 bootstrap replicates. G. ardeae, a 
species found in birds [57], was used as an outgroup to support the phylogeny. Ac- 
cession numbers of sequences used to infer the phylogeny; D14342.1, AF069573.1, 
AF069570.1, AF069574.1, AF069575.1, AF069571.1, AF069572.1, AF069568.1, 
AF069567.1. 



Because of the different host species as well as genetic characteristics, 
reorganization of the assemblages into separate species has been proposed 
and is under debate [58] . The new species names for A and B are suggested to 
be G. duodenalis and G. enterica, respectively. Further studies and gathering 
of phenotypic data may shed light on the biology of non-human associated 
Giardia parasites. 



2.3.4 The Streamlined Genome of G. intestinalis Reveals Many 
Parasite-specific Genes 

One striking feature of G. intestinalis is the highly reduced genome, which is 
comprised of ~12 million base pairs (haploid size) distributed on five chromo- 
somes 19 . Upcroft et al. reported a certain amount of karyotype variability 



in human and animal stocks 59 , suggesting that the karyotype is not com- 



pletely stable. Each nucleus of trophozoites contains a diploid to tetraploid 



set of the genome 60 61 29 . The genome of the assemblage A isolate WB 



was finished 2007 [62], and revealed sparse non-coding sequences, and with 
few exceptions intronless genes and simplified cellular components; includ- 
ing many bacterial- and archaeal-like enzymes. DNA synthesis, transcription, 
RNA processing and cell cycle components were found to be simple [62] . Since 
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the completion of the genome sequence, five genes containing introns have 
been found 62 63 64 65 . The discovery of splicing in G. intestinalis was 
surprising and suggested that splicing was present at an early stage in the an- 
cestral eukaryote. Recently, three irans-splicing events have been uncovered, 



i.e. the joining of physically distant exons into contiguous mature mRNAs 66 



65 . The documented events involved exons of the dhc and hsp90 genes. The 



finding of such relatively complicated transcript maturation pathways further 
contradicts the view of G. intestinalis as a "fossil" eukaryote. The genome 
does encode proteins involved in meiosis [62] , but these may have alternative 
functions. 

Despite extensive efforts to annotate the G. intestinalis genome, ~58% of 
the genes are without a known function. These genes lack sequence similarity 
to other sequenced genomes and may represent Giardia-specific genes. The G. 
intestinalis genome encodes four multigene families: nek (encoding kinases), 
p21.1 (encoding structural proteins, containing ankyrin motifs), vsp (surface 
antigens) and hemp (putative surface antigens). Altogether these genes com- 
prise 30% (3.6 Mbp/12 Mbp) of the genome 62 , indicating that gene duplica- 
tion has been a major evolutionary force. Most genes of these families display 
extensive heterogeneity, indicating significant divergence since the presumed 
gene duplication event. 198 nek genes are present in the genome, most of 
which are predicted to encode catalytically inert proteins [62) . Manning et 
al. found some nek proteins localized to distinct parts of the cytoskeleton 
and cytoplasm [67]. However, the precise roles of most of these proteins are 
unknown. Variant-specific surface antigens encoded by vsp genes cover the 
surface of the parasite and shield it from the host immune system [19| . Se- 
quence analysis of vsp genes has suggested substructure, recombination and 
divergence among these genes 68 . Only one vsp is expressed at the cell sur- 



face at any given time 69 , and switching occurs every 6-13 generations 70 



vsp switching occurs spontaneously, and is proposed to involve an epigenetic 



mechanism ITT] and/or RNA interference 72 



The G. intestinalis genome contains three families of retrotransposons, of 
which two are potentially active and localized to telomeres and one is dead 
and present in interstitial genomic regions [73]. All three families of retro- 
transposons are long interspersed nuclear element (LINE)-like elements. Ullu 
et al. reported a population of small RNAs derived from the retrotransposon 
family GilT/Gcniel located in telomeres, and hypothesized that these may 



have a role in transposon-silencing 74 
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2.3.5 Unexpected Low Heterozygosity in a Polyploid Organism 

The WB isolate of G. intestinalis contained <0.01% heterozygosity as es- 



timated from genomic reads 62 . Asexual organisms with diploid or higher 
genome ploidy would be expected to accumulate extensive genomic heterozy- 
gosity, i.e. single nucleotide polymorphisms between homologous nucleotide 
sites. The phenomenon is known as the Meselson effect and has been observed 



in bdelloid rotifers 75 . However, not all asexual organisms display high lev- 



els of heterozygosity. One prominent example is asexual lineages of Daphnia, 



which reduce heterozygosity via ameiotic recombination 76 



Poxleitner et al. used an episomal plasmid to demonstrate genetic ex- 



change between nuclei of cysts; a process named diplomixis 77 , and may 
partially explain how the parasite maintains low heterozygosity. Carpenter 
et al. showed that cyst formation occurs from a single trophozoite and not 



by fusion of two trophozoites 78 . The authors of the former study also con- 
cluded that nuclear sorting, i.e. each daughter cell receives a pair of identical 
nuclei, is not likely to be a mechanism by which G. intestinalis reduces het- 
erozygosity. The precise mechanism likely involves gene conversion and/or 
homologous recombination and is yet-to-be described. 



2.3.6 Promiscuous Transcription due to Loose Transcriptional Reg- 
ulation? 

G. intestinalis contains two nuclei with an equal amount of DNA, which has 



been shown by DAPI staining 60 . Uridine incorporation into RNA showed 
that both nuclei are transcriptionally active |60| . Recent data indicate that 
the two nuclei may not be completely identical: (i) Bcnchimol et al. showed 
that the two nuclei differ in nuclear pore number and distribution [79| ; (ii) 
Tumova et al. reported that the nuclei differ in both number and size of 
chromosomes (80l; (Hi) a microRNA precursor was found in only one nucleus 



81 ; and (iv) Yang et al. reported allele-specific expression of one vsp 82 . 

Compared with other well-studied eukaryotes, the transcriptional appa- 
ratus is simple: 21 of 28 of the eukaryotic RNA polymerase polypeptides 



are present, but only 4 of the 12 general transcription initiation factors 83 
Sequencing of cDNA clones has found short 5' and 3' untranslated regions, 
sometimes only a few nucleotides [19] . Transcriptome profiling using Serial 
Analysis of Gene Expression (SAGE) and microarrays have uncovered a lim- 



ited set of differentially expressed genes [84 85 . Current data on transcription 
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in G. intestinalis indicate loose regulation at the transcriptional level. Only 
a few regulatory promoter elements have been discovered, mainly for devel- 
opmental genes 86 87 . Intergenic distances are short (the median is 103 
bp) [62| , leaving little space for regulatory elements. Analysis of promot- 
ers has failed to reveal shared motifs or regulatory elements, and suggested 
that promoters are degenerate 88 89 62 . An AT-rich sequence of 8 bp was 
found to be sufficient to drive transcription 90 . Hence, AT-richness appears 
to be the only prerequisite to initiate transcription, possibly explaining the 
abundance of pervasive transcription in this organism 91 1. One consequence 
of this organization is bidirectional promoters, which contribute to pervasive 
transcription |91| . 

Drosha and Exportin-5 are two essential components of the microRNA- 
processing pathway, both of which are missing in G. intestinalis 92 . However, 
the parasite has Dicer and Argonaut homologs. G. intestinalis Dicer has been 
cloned and shown to produce RNA fragments between 25 to 27 nucleotides 
in vitro 93 , although with lower affinity for its small RNA products com- 
pared with the human homolog. A recent study used antisense-ribozyme RNA 
in giardiavirus-infected trophozoites to knockdown expression of the mRNA 
encoding the Argonaut protein [92]. The authors found that knockdown of 
Argonaut mRNA inhibited trophozoite replication, and concluded that Arg- 
onaut has an important role in the parasite. Moreover, the same study found 
a snoRNA-derived small RNA of 26-nt length produced by Dicer, and local- 
ized it to the cytoplasm. Target sites of these small RNAs were identified in 
vsp genes. An independent study similarly reported vsp regulation by RNA 
interference [72]. Altogether, the RNA interference (RNAi) apparatus seems 
to not be completely analogous with that found in metazoans, which is also 
supported by the fact that giardiavirus (a double-stranded RNA virus) can 



replicate in certain strains of the parasite 94 
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2.4 Trypanosoma cruzi — A Pathogen Transmitted by 
Blood-sucking Insects and Cause of Systemic Illness 

Trypanosoma cruzi {T. cruzi) is a protozoan parasite and causative agent of 
Chagas disease (American trypanosomiasis), both discovered and described by 



Carlos Chagas in 1909 95 . Chagas disease is a zoonosis, affecting ~8 million 



people mainly in rural and peri-urban areas of Mexico, Central America and 



South America 96 . A wide range of insect vectors facilitate transmission of T. 



cruzi, and the endemic range of both the parasite and its vectors stretches from 



southern United States to Argentinean Patagonia 97 . Chagas disease also 



occurs in non-endemic countries, because of migratory influx from endemic 



countries in Latin America 98 . More than 300,000 individuals are currently 



estimated to carry the infection in the United States and >80,000 in Europe 



97 . However, since the natural vectors are not present, the disease is mainly 
confined to the infected individuals and accidental transmission via blood 
transfusion or organ transplant. 



Chagas disease is a chronic and systemic illness 97 . The parasite has likely 
existed among animals in the Americas for millions of years, as concluded from 
its wide geographical distribution and host range [99] . Another line of evidence 
comes from observations of pathology, where domestic animals and humans 
often display pathology from the infection, in contrast to wild animals where 
pathology has not been recorded. This suggests that wild animals and the par- 
asite co-evolved, which led to attenuated virulence. Recovered T. cruzi DNA 
from 9000-year-old mummies indicates that the disease has been troubling hu- 



mans for extensive time [100] . In humans, the acute phase of Chagas disease 
is often asymptomatic and lasts from weeks to months. If symptoms do occur 
during this phase, they are benign (fever, swollen lymph glands and occasion- 
ally, local inflammatory reaction at the bite site) [97]. During the acute phase, 
T. cruzi can infect any nucleated cell of the host and parasites may be found 
in the blood. The immune system eventually reduces the parasitaemia, but 
does not clear the infection completely and the individual can remain asymp- 
tomatic for years or decades. At the chronic stage, parasites are still present in 
specific tissues, for example, muscle or enteric ganglia. Several years after en- 
tering the chronic stage, 20-30% of the individuals develop irreversible lesions 
of the heart, colon and/or oesophagus, and in some cases the peripheral ner- 
vous system [97]. The main lesions of Chagas disease are focal and extensive 
myocardial fibrosis, driven by a latent inflammatory response. Cardiomyopa- 
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thy is manifested by cardiac arrhythmias, apical aneurysm, congestive heart 



failure, thromboembolism and sudden cardiac arrest 97 . Chagasic pathology 
of colon and oesophagus is referred to as the digestive form of the disease, 
and is caused by destruction of enteric ganglia. The end result is segmental 
paralysis of the colon and/or oesophagus. Digestive Chagas appears to be 
more prevalent south of the Amazon basin (Argentina, Brazil, Bolivia and 
Chile) [99]. 

Treatment of Chagas disease is currently limited to two drugs introduced 
over 40 years ago, nifurtimox and benznidazole |97| . The drugs appear to 
have variable efficacy, require long treatment periods (60 to 90 days) and do 
not have an effect towards advanced Chagas disease. The drugs can give 
rise to severe side effects, including kidney and liver failure. Nifurtimox can 
also cause neurological disturbances and seizures. Despite many promising 



new drug targets and leads, for example posaconazole 101 , few candidates 
have moved beyond the discovery phase - likely due to limited funding from 
companies and governments. 

2.4.1 T. cruzi is Transmitted by a Wide Range of Insect Vectors 

The parasite, T. cruzi, belongs to the kinetoplastid group, which also in- 
cludes the human parasites Leishmania spp. and T. brucei. T. brucei is 
indigenous to Africa and Leishmania spp. can be found worldwide. Kine- 
toplastid parasites exhibit some unusual molecular processes, such as RNA 
editing 102 , ircms-splicing [103] and antigenic variation 104 . 



cruzi is 



transmitted by several different species of insect vectors, mainly of the genera 
Triatoma, Panstrongylus and Rhodnius (Hemiptera; Reduviidae) . The first 
entomological description of a Triatomine, Triatoma rubrofasciata, was per- 



formed already in 1773 by the scientist De Geer 105 . The most important 



vectors for human transmission are Triatoma infestans, Rhodnius prolixus and 



Triatoma dimidiata 97 . Vector species differ in regional distribution; for ex- 
ample, T. infestans has been the most important vector of sub- Amazonian 
regions, whereas Rhodnius prolixus is the predominant vector in northern 
Latin America. The insects are hematophagous bugs that feed on vertebrate 
blood, causing transmission of the parasite. Insects often live in cracks of 
poor quality rural homes or huts, and emerge at night, biting people near 
the eye or mouth. Many different mammalian hosts act as parasite reservoirs 
and thereby sustain the transmission cycle of T. cruzi. More than 150 species 
of wild (e.g. armadillos, opossums and raccoons) and domestic (e.g. dogs, 



16 



2 INTRODUCTION 



Trypanosoma cruzi 



cats and guinea pigs) animals can act as T. cruzi reservoirs. The disease can 
also be transmitted via non- vectorial mechanisms, including blood transfusion 



106 and organ transplant 107 as well as congenital transfer from mother to 



fetus 1081 1091. Oral transmission is possible via ingested food or liquid 1 110 



and is generally associated with massive parasite proliferation, with severe 



and acute clinical manifestations and high rate of mortality [111 . However 



oral outbreaks are rare but have been documented 112 . Even more rarely, in- 
dividuals have become infected by accidents in the laboratory 113 . Regional 
differences in disease severity have often been suspected and may be due to 
parasite genotype, host genetics, transmission cycles and control programs 



!)<■) 



Control measures for Chagas disease include chemical insect control, im- 
provement of housing conditions and education. Blood can be screened before 
transfusion using serological tests. Most but not all Latin American coun- 
tries have implemented mandatory serology tests for blood donors. Vector 
control programs involving spraying of insecticides on houses and buildings, 
have largely been successful. For example, the "Southern Cone Initiative" 
has reduced Chagas transmission rates in the South Cone of the continent by 



disrupting transmission via the vector Triatomina infestans 114 . 

It is possible that Charles Darwin (1809-1882) contracted Chagas disease 
during his journey to the Americas, as suggested from descriptions of a spe- 
cific incident where he was bitten by reduviid insects and from some of the 



symptoms he suffered later in life 115 



2.4.2 The Complex Life Cycle of T. cruzi 

T. cruzi has a relatively complicated life cycle, with several distinct morpho- 
logical stages, vector species and mammalian hosts (Figure [5]) . The life cycle 



begins in a T. cruzi reservoir, which is an infected animal or human 116 
Infected animals or humans have circulating parasites in the bloodstream. Re- 
duviid insects consume a blood meal from the mammalian reservoir, taking 
up a population of T. cruzi trypomastigotes. Inside the insect, trypomastig- 
otes pass into the midgut and differentiate to amastigote forms. Amastigotes 
are 3-5 fim in diameter, proliferate and transform into epimastigotes in the 
midgut of the insect. There is no evidence that the parasite is harmful to the 
insect. Epimastigotes also proliferate, and move to the hindgut, where they 
transform to metacyclic trypomastigotes. Metacyclogenesis may be triggered 
by substrate interaction of the flagella |117| . Metacyclic trypomastigotes are 
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then excreted via insect feces, and infection can occur if feces come into con- 
tact with the bite wound or mucosal membranes. T. cruzi uses its flagella to 
move into the mammalian host. 



Triatomine insect stages 

Blood meal and infection 
" via insect feces 



Human stages 



Metacyclic trypomastigotes 
penetrate human cells 
at the bite site. Inside cells they 
transform into amastigotes. 




Amastigotes 
multiply by 
binary 
fission 
in the 
cells. 



Intracellular amastigotes transform into 
trypomastigotes then burst out of the cell and enter the bloodstream. 



Figure 5: The T. cruzi life cycle. Image credit: U.S. Centers of Disease Control 
and Prevention. 



Parasites then invade host cells via a mechanism involving the cytoskeleton 
and host cell lysosomes 118 119 . Parasites are taken up by the lysosomes, 
and subsequent acidification inside the lysosomes activates parasite-secreted 
porin-like molecules that facilitate escape from the vacuole |120| . Once inside 
the cytoplasm, parasites differentiate into amastigotes and begin to prolifer- 
ate, forming "pseudocysts," and eventually turn into trypomastigotes again. 
Pseudocysts burst due to the parasite load, and large amounts of parasites 
are released to the bloodstream and they can infect new cells or get ingested 
by reduviid insects. 



2.4.3 The Population Structure of T. cruzi is Wide, Complex and 
Contains Signatures of Ancestral Hybridization 

The parasite causing Chagas disease, Trypanosoma cruzi sensu stricto (s.s.), 
is the type species of the subgenus Schizotrypanum. In addition to T. cruzi 
s.s., the Schizotrypanum subgenus harbors approximately half a dozen other 
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trypanosome species, often referred to as T. cruzi-\ike species. Most T. cruzi- 
like species are restricted to bats (order Chiroptera), and are morphologically 
difficult to discriminate 121 . One of these bat-restricted organisms is Try- 
panosoma cruzi marinkellei (T. c. marinkellei) , which was first characterized 
by Baker et al. in 1978 122 . T. c. marinkellei is regarded as a subspecies of 
T. cruzi and is indigenous to South- and Central American bats. The human 
infective lineage, T. cruzi s.s., should therefore be referred to as the nominate 
subspecies Trypanosoma cruzi cruzi. However, in this thesis the human infec- 
tive parasite is simply referred to as T. cruzi, or T. cruzi s.s. when applicable. 
Lewis et al. estimated the divergence of T. cruzi s.s. and T. c. marinkellei 



at 6.51 million years ago using the gpi gene [123] . 

T. cruzi propagates predominantly via binary fission. However, Gaunt 
et al. created hybrid clones of distinct T. cruzi s.s. strains (in vitro), and 
thereby showed that T. cruzi s.s. has an extant capacity for genetic exchange 



48 . The mechanism of genetic exchange is somewhat unusual, involving fu- 
sion of cells followed by genomic erosion to a diploid genome. Sexual events 
have likely shaped the current population structure of T. cruzi s.s., but have 
been sufficiently rare to allow clonal propagation during long periods of time 



124 . T. cruzi s.s. is currently partitioned into six discrete typing units (DTU; 



TcI-TcVI; Figure [8]). Two of these, TcV and TcVI are the result of ances- 
tral hybridization events from Tell and TcIII 125 . In addition to the six 



DTUs, there is one genotype identified only in Brazilian bats, TcBat 126 



The genetic heterogeneity of T. cruzi s.s. may explain differential clinical 
manifestations. The null hypothesis of neutral DTU subdivision with respect 
to Chagas disease severity can safely be rejected, but there is currently no 
definite correlation between DTU and disease outcome. 



2.4.4 Was the Ancestor of T. cruzi sensu stricto a Bat Trypanosome? 

South American trypanosomes diverged from African trypanosomes after the 



break-up of Gondwanaland, and evolved parasitism independently 127 128 



129 . While it is impossible to precisely date the divergence, estimates based 



on ribosomal RNA genes and biogeographical data, suggest that it occurred 90 



to 100 million years from present 128 130 . This implies that the divergence 



of the present day T. cruzi and T. brucei predated the origins of insect vectors 
and placental mammals. A phylogeny of the closest known relatives of T. cruzi 
is shown in Figure [6] T. brucei occurs exclusively in Africa and is transmitted 
by tsetse flies. In contrast to South American trypanosomes, T. brucei could 
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have co-evolved with primates and hominids for many million years. On 
the other hand, human presence in the Americas stretches no further than 



^30,000 to 40,000 years from present 131 . Hence, T. cruzi can only have 



been in contact with humans for this period of time. An increase in human 
agricultural activities ~10,000 years ago was likely the first contact of the 
parasite with humans, and at that time most infections were likely to have 
been accidental. Parasite infections then gradually became more prevalent 
when human dwellings became infested with the insect as an extension of 
its natural habitat. Possibly, deforestation and an increase of agriculture 
facilitated the spread of insect vectors and thereby the disease. 
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Figure 6: Maximum likelihood phylogeny of trypanosomatid species based on the 
gGAPDH gene. Alignments were done with ClustalW2 and inferred with MEGA5 
|56| using the Tamura-Nei model, 1000 bootstrap replicates were performed. The 
scale bar refers to number of substitutions per site. Bootstrap values are close to the 
branches. T. brucei was used as an outgroup to support the phylogeny. Geography 
and hosts are indicated to the right. Accession numbers of the included sequences; 
JN040964, GQ140362, GQ140360, GQ140358, GQ140364, AJ620283, AJ620267, 
Tb927.6.4280 (GeneDB). 



Current data indicate a strong association between bats and T. cruzi-\ike 
flagellates, suggesting a long period of shared evolutionary history. Thomas 
et al. reported that hematophagous arthropods might act as vectors for the 
transmission of T. cruzi-like species among bats 132 . Many of the infected 
bats are insectivorous, suggesting that bats become infected upon feeding on 
insect vectors. Hamilton et al. proposed the hypothesis that the ancestor 
of T. cruzi s.s. was a bat trypanosome, which made multiple jumps to ter- 
restrial mammals 133 . Thus, the broad mammalian host range of T. cruzi 
s.s. may be a characteristic derived from a bat-restricted trypanosome. The 
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following observations support the "bat-seeding hypothesis" of T. cruzi s.s.: 

(i) the subgenus Schizotrypanum is dominated by bat-associated parasites; 

(ii) the closest known relative of T. cruzi s.s. is T. c. marinkellei, found 
only in South- and Central American bats; (Hi) Lima et al. reported a new 
trypanosomatid species of African bats, Trypanosoma erneyi, forming a clade 
within the Schizotrypanum with T. c. marinkellei as a sister clade |134| (Fig- 
ure^; (iv) the present day T. cruzi s.s. has been found in bats, albeit at low 
prevalence 135 136| 137] ; (v) Marcili et al. recently reported a new genotype 
of T. cruzi s.s. (TcBat) that is only found in bats 126 , which however little is 
known about and conclusions about its host specificity may reflect insufficient 
sampling; (vi) compared with T. brucei, the present day population structure 
of T. cruzi is wider and more complex, consistent with a dispersion facili- 
tated by bats. One study recently reported new strains of the bat-restricted 
trypanosome T. dionisii in British bats, suggesting natural movement of bats 



between the Old and New World 138 



'Ecological host switching' describes a process where parasites may acquire 
new hosts or expand its host range without evolving new host utilization 
capabilities 139 . The process of ecological host switching has been proposed 
as the mechanism by which the ancestral T. cruzz-lineage colonized terrestrial 



mammals 140 



2.4.5 An Unusual Amount of Genomic Redundancy 



T. cruzi strains exhibit extensive variation in DNA content 141 142 143 



144 , illustrating the diversity of this species. The genome sequence of the T. 



cruzi clone CL Brener (TcVI) has been determined |145| . The CL Brener clone 
is highly virulent and was isolated from the blood of mice infected with the 



parental strain CL 146 . The CL strain was originally isolated from Triatoma 



infestans in 1963 147 . Several factors complicated genome finishing and re- 



sulted in a genome sequence of lower quality than those of T. brucei 148 and 



Leishmania major 149 , which were sequenced in parallel: (i) the CL Brener 



clone was a genetic hybrid, i.e. it consisted of two 3-4% diverged haplotypes, 
referred to as non-Esmeraldo-like and Esmeraldo-like; (ii) the genome was en- 
riched with sequence repeats of various types, comprising ~50% of the genome; 
and (Hi) the karyotype of the CL Brener strain was found to be complex, con- 



sisting of at least 80 chromosomes of various sizes 150 . Arner et al. realigned 



the shotgun data from the genome project with the assembly and showed that 



many genes existed in almost identical copies 151 . This indicated that copy 
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number variation must have been introduced relatively recently, since alleles 
have had little time to diverge. Weatherly et al. organized contigs and scaf- 



folds from the genome project into longer chromosome- wide sequences 150 
although 30-40% of the genome is still unresolved. The majority of excluded 
genes belong to gene families. Most of the sequence repeats that went into 



the original genome draft 145 were genes encoding surface antigens, such as 
irans-sialidases (TSs), mucins, mucin-associated surface proteins (MASPs), 
dispersed gene family 1 (DGF-1), GP63 peptidases, and retrotransposons. 
Many of the repeated genes exist as pseudogenes in the genome. One promi- 
nent example is the TS family, containing at least 693 pseudogenes in the draft 
genome sequence, and likely many more alleles that fell outside the assembly 



145 . Many of the surface proteins are glycosylated, and they cover and shield 



the parasite from the host immune system. Some TS proteins transfer siliac 



acid from the host to mucins 152 , and have been proposed as drug targets 



153 . Minning et al. used comparative genomic hybridization to sample mul- 



tiple independent strains and found widespread copy number variation and 



whole chromosome aneuploidies 154 



Most T. cruzi genes arc densely packed into polycistronic transcription 
units (PTU), which are separated by strand switch regions [i~03] . RNA poly- 
merase (pol) II drives transcription in two different directions, resulting in 
polycistronic pre-mRNAs. The pre-mRNA from the PTU is subsequently 
matured via irans-splicing and polyadenylation to mRNAs. trans-splicing 
involves the ligation of a 39-nt spliced-leader sequence to the 5' end of tran- 
scripts. Since the life cycle is complex, the parasite needs to regulate its gene 
expression in order to adapt to different hosts and local environments. While 
RNA pol II is responsible for the overall transcription of PTUs, the genome 
lacks defined promoter elements. This organization suggests that individ- 
ual genes are not regulated at the transcriptional level; rather it is assumed 
that gene expression is regulated at the post-transcriptional level. However, 
recent evidence indicates that epigenetic mechanisms may be involved: (i) 
Respuela et al. provided evidence of acetylation and methylation at diver- 
gent (head to head) strand switch regions, but did not find these patterns 



at convergent (tail to tail) strand switch regions or within PTUs 155 ; and 



(ii) Ekanayake et al. reported the presence of the glycosylated thymine base 



(/3-D-glucosyl-hydroxymethyluracil or base J) close to PTUs 156 , 157 . Base 
J is rare outside of the Kinetoplastida, it has only been found in Diplonema (a 
small phagotropic marine flagellate) |158|, and Euglena gracilis (a unicellular 
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algae and close relative to kinetplastids) [159 . 



T. cruzi also contains mitochondrial DNA, which is present in a disk-like 
structure known as the kinetoplast. Kinetoplast DNA can be divided into 
maxicircles and minicircles and are circular molecules that are interlocked in 
a complex network. Their precise size depends on the strain, but maxicircles 
typically occur in 20-30 copies per cell and range in size from 35 to 50 kb. 
Minicircles occur in thousands of copies per cell and are approximately 0.8 
to 1.6 kb. Transcripts of minicircles and maxicircles are involved in uridine 



insertion/deletion RNA editing 160 . Some evidence indicates that minicircles 



can integrate into the host genome 161 162 , and therefore have the potential 



to evoke immune responses and alter host gene expression. 

2.4.6 Lack of RNA interference in T. cruzi but not in T. brucei 

Small RNAs are non-coding RNA molecules that are either functional or non- 
functional. In many eukaryotes, functional small RNAs are abundant and par- 
titioned into many different classes, e.g. microRNAs, short interfering RNAs, 



piwi-interacting RNAs and several others 163 . RNA interference (RNAi) is 
a gene silencing process, deeply rooted in eukaryotes, mediating silencing via 
RNA-induccd degradation of target transcripts. At the heart of RNAi lies 
the Argonaute/Piwi protein complex, which exerts post-transcriptional gene 



silencing. In the kinetoplastids, the presence of RNAi is variable 164 . Ngo 



et al. showed already in 1998 that T. brucei possesses functional RNAi 165 



166 , but RNAi is missing or non-functional in T. cruzi 167 . In Leishma- 
nia spp. the situation is similar, RNAi is present in some species (e.g. L. 
braziliensis, L. panamensis, L. guyanensis) but not in others (e.g. L. major, 



L. donovani, L. mexicana) [164 . These data suggest that RNAi was lost twice 
in the evolution of the kinetoplastids. Future research will be needed to an- 
swer if T. cruziASke organisms are also RNAi- negative. The cause of RNAi 
loss can only be speculative, but it is possible that loss of active mobile ele- 
ments freed the parasite from keeping RNAi to mitigate the effects of mobile 
elements. It is also possible that loss of RNAi was selected for, i.e. if the loss 
altered gene expression so that it affected virulence or other properties. 



While small RNAs of T. brucei have been relatively well studied 168 



169 170 , T. cruzi has received less attention on the subject. Analysis of the 



genome sequence has shown that T. cruzi lacks Dicer and Argonaute homologs 



145 . However, the genome contains a gene encoding an Argonaute/Piwi- 



like protein, but apparently without a recognizable PAZ domain [171] . The 
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significance of this gene is uncertain. However, the lack of functional RNAi 
does not exclude the existence of small RNAs that may exert effects through 
other pathways. Small RNA species of T. cruzi containing the spliced leader 



mini-exon and other small RNAs were reported early 172 173 . Garcia-Silva 



et al. reported a population of tRNA-derived small RNAs that localized to 



cytoplasmic granules, and increased during nutritional stress 174 
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3 Aims 

... we have come to the edge of a world of which we 
have no experience, and where all our preconceptions 
must be recast. 

- D'Arcy Wentworth Thompson (1917), biologist. 



The aim of the thesis was to further understand intraspecific genomic 
variation and transcriptional features of the protozoan parasites Giardia in- 
testinalis and Trypanosoma cruzi. 

3.1 Specific aims 
Paper 1 

Genome sequence comparison of the two human-infecting genotypes of Giar- 
dia intestinalis (A and B). 

Paper 2 

Identify genomic features that distinguish a non-human associated genotype 
of Giardia intestinalis (E) from two human-infective genotypes (A and B). 

Paper 3 

Genome comparison of Trypanosoma cruzi sensu stricto with the bat-restricted 
subspecies T. cruzi marinkellei. 

Paper 4 

Characterization of the short, non-coding transcriptome of Trypanosoma cruzi 
in order to understand if the parasite has functional classes of small RNAs. 

Paper 5 

Characterization of the transcriptome of Giardia intestinalis at single nu- 
cleotide resolution and investigation of gene expression divergence. 



(8/SMSVSI 
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4 Present investigation 

Science is what we have learned about how not to fool ourselves about 
the way the world is. 

- R.P. Feynman (1918 - 1988), theoretical physicist. 



4.1 Paper i and ii: Genome Comparison of Three Dis- 
tinct Isolates of Giardia intestinalis 

In Paper i and ii we performed genome comparisons of three distinct isolates 
(strains) of G. intestinalis, representing genotypes (syn. assemblages) A, B 
and E (see Figure [4] for a phylogeny of genotypes A to G). While A and B 
infect humans, the E genotype is only associated with hoofed animals. The 
representative isolates are summarized in Table [l] 



Table 1: Summary of compared isolates 



Isolate 


Assemblage 


Host 


Country 


Ref. 


a Accession b 


WB 


AI 


Human 


Afghanistan 




175 


AACB00000000.1 c 


GS 


B 


Human 


U.S.A. 




176 


ACGJ00000000.1 


P15 


E 


Pig 


Czech Rep. 




177 


ACVC00000000.1 



a Original description of the isolate. 

b NCBI GenBank accession number of the genomic data. 
c An updated record (AACB00000000.2) is now available. 



Morrison et al. described the genome of the WB isolate using Sanger 
sequencing [62]. In paper i and ii we sequenced the genomes of GS and P15 
using Roche/454 sequencing (the former with FLX chemistry and the latter 
mainly with TIT chemistry; see |178 for an overview of the technology). 
Sequence assembly was performed de novo using the assembly software MIRA 
(Chevreux B, unpublished), and contiguous sequences (contigs) were then 
improved using targeted Sanger sequencing. Assembly and finishing resulted 
in 2,931 (N 50 34,141 bp) and 820 (N 50 71,261 bp) contigs of the GS and P15 
genomes respectively. The assemblies were more fragmented than that of WB, 
but still represented the complete genomes of GS and P15, as confirmed by 
analysis of non-assembled reads and assembly sizes (11,001,532 bp of GS and 
11,522,052 bp of P15). The P15 assembly was slightly more contiguous, due 
to the slightly longer read lengths of the TIT chemistry. 
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Contigs of the two genomes were annotated using a two-tier approach: (i) 
automatic transferring of gene models from the reference strain 62 ; and (ii) 
followed by manual curation. The reference genome (WB) was downloaded 
from the database GiardiaDB [179] , which is part of the EupathDB initiative 
to integrate genome-wide data sets from eukaryotic pathogens. Open reading 
frames (ORFs) were extracted from GS and P15, and annotated using best 
reciprocal BLAST toward genes of WB. Gene models were then manually in- 
spected and unlikely gene models were discarded. Cross-genome comparisons 
allowed selection of the most conserved and therefore most likely start codon. 
The majority of the nek and p21.1 families were assigned orthologs, but this 
was not the case for most vsp and hemp genes. This suggested that vsps 
and hemps have undergone lineage-specific diversification events, for example 
positive selection or recombination. 




i i i i i 

20,000 40,000 60,000 80,000 100,000 



Genomic contig position 

Figure 7: Heterozygous loci of the GS isolate counted in sliding windows along a 
genomic contig. Heterozygous loci were determined using alignments of Roche/454 
reads. Windows were of the size 1000 bp and overlapped with 50%. The contig 
has the accession number ACGJ01002920. (Y-axis) Heterozygous loci per window. 
(X-axis) Position along the sequence. 



Genomic heterozygosity was almost absent in the WB isolate (<0.01%) 
62 , the same was found in P15 with extremely few heterozygous loci. Con- 
versely, GS exhibited extensive genomic heterozygosity, which was genome- 
wide estimated to ~0.5%. The data did not per se reveal whether the observed 
differences were located in the same or different nuclei. Figure [7] shows how 
heterozygosity varies along a contig representing 0.83% of the genome. About 
half of detected heterozygous loci were located in coding sequences, and 38% 
changed amino acid (non-synonymous changes). As expected, there was a 
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strong bias toward transitions. One implication of a heterozygous genome is 
the potential to encode additional protein isoforms. Interestingly, heterozy- 
gous loci were frequently clustered, intervened by long homozygous regions. 
The presence of heterozygous loci in clusters rather than homogeneously dis- 
persed over the whole genome are in disfavor of the Meselson effect, i.e. the 
phenomenon where asexual organisms accumulate heterozygosity in the ab- 
sence of sexual or ameiotic recombination. A similar mosaic pattern was 
observed in Naegleria gruberi 180 . As seen in Figure [7| heterozygous loci are 
more predominant on the left half of this genomic segment. One interpreta- 
tion of this pattern would be that homogenization via gene conversion has 
partially taken place. It is also possible that other processes have contributed 



to heterozygous deficit, such as the Wahlund effect or selfing/homogamy 181 



These indirect observations thus suggest that GS has undergone a more recent 
sexual event as compared with WB and P15. The two latter isolates may have 
longer asexual histories. 



4.1.1 The Core Genome and Isolate-specific Genes 

The shared and non-shared gene content of the three isolates was investigated 
using reciprocal BLAST searches. The analysis revealed that the core gene 
content could be defined by 4557 genes, which excluded isolate-specific genes 
and vsps. The core gene content was comprised of housekeeping genes with 
homology to other eukaryotes, and G. intestinalis-specific genes. Thirty-eight, 
thirty-one and five genes were specific for P15, GS and WB respectively. One 
of the P15-specific genes represented an acetyltransferase, and phylogenetic 
analysis indicated a bacterial origin (likely from a bacterial species of the 
group Firmicutes). The donor lineage could not be precisely defined, but may 
be one of Lactobacillus, Cloststridium, Anaerotruncus or Enterococcus, all of 
which are common inhabitants of the gastrointestinal tract. This suggests the 
uptake of the gene was relatively recent, and is an example of bacteria-to- 
eukaryote horizontal gene transfer. The GS genome contained several genes 
that were likely transferred from bacteria, one of which is likely an example 
of "dead upon arrival," i.e. it was most likely transferred as a pseudogene. 
At least 96 genes with detectable homology to bacterial genes were conserved 
in the A, B and E genotypes and have attained housekeeping functions. The 
mechanism behind horizontal gene transfer is not determined, but likely in- 
volves multiple successive steps; where each must be successful in order for 
the gene to be integrated into the new genome. A successfully integrated 
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gene must also not be deleterious to the new host and convey a selective ad- 
vantage to become fixed in the population. Frequent exposure to immense 
bacterial populations in the intestine has most likely contributed to creating 
opportunities for horizontal gene transfers. 

4.1.2 Structural Variation and Overall Divergence 

Synteny breaks refer to disruption of gene order, often due to genomic rear- 
rangements. In both the GS and P15 (compared to WB), synteny breaks were 
recorded and tended to occur in regions devoid of housekeeping genes. These 
regions displayed an atypical nucleotide composition in a sliding-window anal- 
ysis and deviated in GC-content. Rearrangements may have been introduced 
spontaneously without affecting parasite fitness, and may therefore have cir- 
cumvented purifying selection. 

The average amino acid identity between WB and P15 was 90%, 81% 
between P15 and GS and 78% between GS and WB, as measured by comparing 
orthologous genes. The sequence identities recapitulated that of single-gene 
phylogenies, confirming the accuracy of previous phylogenetic trees, which are 
usually not inferred from genome-wide data. The sequence divergence of P15 
and WB was similar to what is observed between L. major and L. infantum, 
whereas the divergence of GS and WB was similar to that of Theileria parva 
and T. annulata. Hence, the divergence between the studied G. intestinalis 
genotypes is similar to what is observed between distinct species. It can thus 
be argued that the G. intestinalis genotypes should be regarded as separate 
species rather than genotypes. 

The dN/dS ratio (rate of non- synonymous nucleotide substitutions/rate of 
synonymous nucleotide substitutions) can be used to indirectly identify posi- 
tive selection. The WB vs. GS comparison did not allow calculation of dN/dS 
ratios since synonymous changes were saturated (i.e. more than one substitu- 
tion per site). Analysis of dN/dS of WB and P15 indicated, as expected, that 
most of the genome was under purifying selection. Several uncharacterized 
genes were found to exhibit elevated dN/dS ratios (>1), indicating putative 
positive selection. Gene Ontology analysis indicated that five GO categories 
contained genes under positive selection, possibly reflecting co-evolution of 
multiple genes involved in common pathways. Furthermore, SAGE data were 
used to categorize genes into developmental categories. Four developmental 
categories displayed elevated dN/dS ratios, possibly reflecting lineage-specific 
divergence of cellular processes. 
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4.2 Paper Hi: Genome Comparison of Trypanosoma cruzi 
sensu stricto With the Bat-restricted Subspecies T. 
cruzi marinkellei 

The genome of the human infective T. cruzi (T. c. cruzi; T. cruzi s.s.) was 
compared with that of its closest relative, T. c. marinkellei (Tcm). Two 
clones were selected for the comparison: (i) T. cruzi s.s. Sylvio X10, which 
was isolated in 1983 from a human male in Para State Brazil, and has con- 



firmed pathogenicity 182 . Sylvio X10 is subgrouped into DTU Tel. (ii) Tcm 
clone B7, which was originally isolated in 1974 from the bat host Phyllostomus 
discolor in Sao Felipe Bahia Brazil (M.A. Miles and T.V. Barrett) |122| . Tcm 
B7 was not found to be infective in immunocompromised mice, nor did it pro- 
vide immunological protection against subsequent challenge with T. cruzi s.s., 



suggesting distinct antigenic profiles 122 . Tcm is restricted to South Ameri- 



can bats and has to date not been recovered from humans. The phylogenetic 
relationship of T. cruzi s.s. and Tcm is shown in Figure [8] 

T. cruzi s.s. Sylvio X10 (SX10) is a non- hybrid strain, and it has a smaller 
genome |144 than the previously sequenced T. cruzi s.s. strain CL Brener 



(TcVI; Figure [8| 145 . The smaller genome and non-hybrid nature implies 
that this clone likely has fewer repetitive sequences. T. cruzi s.s. SX10 is also 
evolutionarily distinct to T. cruzi s.s. CL Brener, and therefore creates an 
interesting basis for comparison. 



Table 2: Summary of compared T. cruzi clones 



Species 


Subspecies 


DTU a 


Clone 


Host 


Ref. 


T. cruzi 


T. c. cruzi 


Tel 


Sylvio X10/1 


Human 


182 




T. cruzi 


T. c. cruzi 


TcVI 


CL Brener 


Human 


147 




T. cruzi 


T. c. marinkellei 




B7 


Bat 


122 





a Discrete Typing Unit; genotype. 



The study began with Roche/454 and Illumina sequencing of sheared ge- 
nomic DNA of T. cruzi s.s. SX10 and Tcm B7. Briefly, Roche/454 and 
Illumina reads were assembled separately de novo and the assemblies were 
then merged into a non-redundant assembly. The merged assemblies were 
subjected to quality enhancements, including scaffolding, gap closure and ho- 
mopolymere error correction. The T. cruzi s.s. CL Brener genome was used 
for transferring gene models to the new genomes, since the majority of the 
gene repertoire was expected to be shared. The genomes were annotated us- 
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ing a semi-automatic pipeline. Orthologs were identified using best reciprocal 
BLAST, and gene models were manually curated. Flow cytometry estimated 
the genome sizes of T. cruzi s.s. SX10 and Tcm B7 to 44 and 39 Mb respec- 
tively (haploid genome sizes). The haploid genome size of T. cruzi s.s. CL 



Brener has previously been estimated to 55 Mb 145 . The assembly sizes of the 



genomes closely correlated with the experimental measures, which confirmed 
the computational steps. 



B 

i— 
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Figure 8: Nucleotide maximum likelihood phylogenies of T. cruzi s.s. DTU TcI-VI. 
The phylogeny was inferred from concatenated sequences encoding the beta-adaptin 
and endomembrane proteins. Sequences were aligned with ClustalW2 and inferred 



with MEGA5 56 using the Tamura-Nei model. 1000 bootstrap replicates were 
performed. Scale bars refer to number of substitutions per site. Numbers close to 
branches indicate bootstrap support values. T. brucei (A) and T. c. marinkellei 
(B) were used as outgroups. Tip labels in bold indicate genotypes sequenced in 
the present study. (Blue tip label) Available reference genome (non-Esmeraldo-like 
haplotype). (Accession numbers) of the dataset: HQ859539, HQ859587, HQ859534, 
HQ859592, HQ859540, HQ859582, HQ859535, HQ859583, HQ859538, HQ859590, 
HQ859543, HQ859585, HQ859541, HQ859593, Tb927. 10.8040, Tbll.02.0960. 



Heterozygosity in the T. cruzi s.s. SX10 and Tcm B7 was estimated to 0.19 
and 0.22% respectively. Sliding window analysis revealed that heterozygous 
sites were often clustered in blocks. The organization of heterozygous loci 
therefore resembled that found in T. cruzi s.s. CL Brener, although at much 
lower levels. One could speculate that the mosaic structure is a result of gene 
conversion. Overall, the heterozygosity levels were similar to those found in 
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some Leishmania species 183 . 



4.2.1 Evidence of a Recent Eukaryote-to-Eukaryote Horizontal Gene 
Transfer 

The genomes of T. cruzi s.s. SX10, T. cruzi s.s. CL Brcncr and Tcm B7 were 
systematically searched for gene differences. The search identified a unique 
1,662 bp acetyltransferase gene in Tcm B7. The gene will be referred to by its 
locus tag MOQ_006101. Sequence analysis revealed that the flanking genomic 
regions were present in T. cruzi s.s. SX10 and T. cruzi s.s. CL Brener, but 
not the gene itself. MOQ-006101 was not identified in unassembled genomic 
reads of T. cruzi s.s. SX10 or T. cruzi s.s. CL Brener, suggesting it is unique 
to Tcm B7. Several fragments of VIPER retrotransposons were identified 
close to the locus, and RT-qPCR confirmed expression of the gene. 

Phylogenetic reconstruction indicated that the closest known homologs 
were in algae and plants, suggesting MOQ-006101 was transferred from an- 
other eukaryote. MOQ_006101 showed low sequence identity (30-50%) toward 
genes in NCBI GenBank. The absence of intron-exon boundaries suggested 
MOQJ306101 was transferred as a spliced mRNA, likely from a species not 
represented in GenBank. 

The GC-content of the gene was compared with the global GC-content 
in coding sequences. GC-content analysis indicated an unusual composition 
compared to the rest of the genome, strengthening the notion of a transfer 
from another species. It remains to be determined whether MOQJ306101 
encodes a functional protein. Moreover, examination of multiple Tcm isolates 
may answer whether MOQ_006101 has been fixed in the species or only in 
a certain lineage. Whether functional or not, the gene itself is interesting 
since it represents an unusual instance of horizontal gene transfer between 
two eukaryotes. 

4.2.2 Trypanosoma cruzi marinkellei has Capacity to Invade non- 
Bat Epithelial Cells 

The potential of Tcm B7 to invade cell lines other than bat was investigated. 
Three common cell lines were selected: (i) Vero cells (kidney cells from African 
green monkey); (ii) OK cells (from a North American opossum); and (Hi) 
Tbl-lu cells (bat lung). The experiments showed that Tcm B7 has retained 
the capacity to invade each of the three cell lines, despite that the cells were 
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originally derived from different species. Surprisingly, there was no preference 
for bat epithelial cells. Prolonged incubation of the parasite with cells showed 
that Tcm B7 is able to replicate intracellularly and the invasion process ap- 
pears to be analogous to T. cruzi s.s.. In conclusion, these data indicate that 
the bat-specificity of Tcm is not mediated at the cell-invasion level. 
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4.3 Paper iv: The Small RNA Component of the Try- 
panosoma cruzi Transcriptome 

A cDNA library of small RNAs (sRNAs) of the size range 16 to 61 nucleotides 
(nt) was prepared and sequenced from T. cruzi CL Brener epimastigotes, i.e. 
the insect stage of the parasite. Epimastigotes were used due to the ease of 
obtaining sufficient RNA from this life stage. Other stages often require culti- 
vation together with mammalian cells, which may contaminate the subsequent 
RNA preparation. We also assumed that functional sRNAs, if present at all, 
would also be present in epimastigotes. The particular size range of 16 to 61 
nucleotides was selected to avoid spliced leader RNA, which could otherwise 
cloud the analysis. 

Sequencing generated 582,243 sRNAs, of which 90.7% aligned with the 
genome sequence. We subsequently used the annotation of the genome to as- 
sign sRNAs into relevant categories, i.e. if an sRNA overlapped a tRNA, it was 
assigned to the tRNA category, etc. With respect to the non-Esmeraldo-like 
haplotype, 72.1% of the sRNAs derived from transfer RNAs (tRNA-derived 
small RNAs; tsRNAs). 97.4% of sRNAs were derived from only three classes 
of non-coding RNAs (transfer RNA, ribosomal RNA and small nuclear RNA). 
Only 0.18% of sRNAs were derived from protein-coding genes. 2.42% of sR- 
NAs could not be grouped into any canonical RNA class. A few of the small 
RNAs were experimentally validated using a real time quantitative PCR-based 
assay. In conclusion, the bulk of sRNAs of the 16 to 61 nt size range were 
derived from known classes of non-coding RNAs. 

The median length of tsRNAs was 38 nt, and 88.9% of them were derived 
from the tRNA 3' end. One example of a tRNA-derived small RNA is shown in 
Figure [9] 75.3% of the 3'-derived tsRNAs contained the post-transcriptional 
'CCA' extension, a hallmark of mature tRNAs. This indicated that most tsR- 
NAs were derived from mature tRNAs, albeit not all. However, it is possible 
that the 'CCA' extension has been lost during sample or library preparation. 
If tsRNAs would represent degradation products of tRNA turnover, one would 
expect a correlation between sRNA copy number and the expression level of 
tRNA genes. Analyses of amino acid usage as a substitute of tRNA gene 
expression data did not find any correlation. The cleavage site of tRNAs was 
present inside the anticodon loop, suggesting endonucleolytic cleavage as the 
responsible mechanism of generation. 
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Figure 9: Secondary 
(TcOO. 1047053508861. 10). 
SE [187] and visualized using VARNA [188 



structure of tRNA-His from T. cruzi CL Brener 
The secondary structure was predicted using tRNAScan- 
The sequence in red indicates the part 



cleaved into a small RNA with copy number 41,929 in the data set. 



1.69% of the small RNAs were not derived from known non-coding RNAs. 
These small RNAs were clustered based on their genomic alignment coor- 
dinates. Clustering formed 92 distinct expression loci, of which homology 
searches revealed known non-coding RNAs for 13 loci. The remaining 79 loci 
did not fall into known non-coding RNA classes and had an average length 
of 54 nt. No homology was found in Rfam or GenBank databases. 35 of the 
novel RNAs folded into non-hairpin secondary structures and 18 folded into 
hairpin structures. 1,159 small RNAs were not clustered and had a median 
length of 24 nt. Of these small RNAs, BLAST searches revealed 335 to be 
derived from rRNA and 819 from protein-coding genes. MicroRNA target site 
prediction of the latter population using the "seed region" (nt 2-8) predicted 
the possibility that these regulate certain categories of mammalian genes. It 
is therefore tempting to speculate that the parasite may use small RNAs for 
inter-cell communication, or possibly modification of the gene expression re- 
sponse of the host cell. 

0.13% of small RNAs were derived from repeats, including retroelements. 
In particular the CZAR element contained 446 mapped small RNAs, which 
may indicate putative initiation fragments from the transcription of these 
elements. There was no overrepresentation of antisense reads in any repeat 
class, suggesting that small RNAs have no role in perturbation of mobile 
elements. Searches did not reveal any canonical classes of regulatory sRNAs 
found in metazoans. This is consistent with the lack of RNA interference. 
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4.4 Paper v: Transcriptome Profiling at Single Nucleotide 
Resolution of Diverged G. intestinalis Genotypes 
Using Paired-end RNA-Seq 



Watson strand 





|^ Crick strand ^ ^^^^^l^^^^WP 

Figure 10: Coverage of RNA-seq sequence reads on three open reading 
frames (logarithmic scale; blue horizontal bars: GL50803_92132, GL50803_92134, 
GL50803_40369). (Watson strand; "plus") Top. (Crick; "minus") Bottom. The 
strand of the open reading frame is shown by arrows ('>' refers to the plus strand 
and '<' refers to the minus strand). 



In this study we performed transcriptome sequencing (RNA-Seq) and com- 
parison of the polyadcnylatcd transcriptomes of four diverged isolates of G. 
intestinalis. The specific aims of the study were to: (i) identify genotype- 
specific patterns of gene expression; (it) confirm and refine genome annota- 
tions; and (iii) identify qualitative transcript properties like 3' untranslated re- 
gions (UTRs). Total polyadenylated RNA was harvested from in vitro grown 
trophozoites of the four isolates WB (AI), AS175 (All), P15 (E) and GS (B). 
Sequencing libraries were prepared according to a strand-specific paired-end 
protocol and sequenced on Illumina HiScq 2000 as 2xl00-nt reads. Each li- 
brary was sequenced as two technical replicates in order to estimate technical 
variation. The reproducibility of the RNA-Seq method was determined us- 
ing two biological replicates of AS175. RNA-Seq generated 33 to 41 million 
read-pairs from each library, which were aligned (mapped) to the correspond- 
ing reference genome (Table [3]). Figure 10 shows the strand-specificity of the 
obtained data. The aligned RNA-Seq data were then used to calculate digi- 
tal gene expression values, formulated as fragments per kilobase of transcript 
per million fragments mapped (FPKM). 49 genes were selected for RT-qPCR 
validation, which confirmed the accuracy of the measurements. Moreover, a 
global comparison was performed with three microarray data sets from Giar- 
diaDB 179 , indicating a moderate but significant correlation of the two tech- 
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niques. RNA-Seq measurements did not correlate with SAGE data, which is 
not surprising since SAGE is generally not quantitative. 



Table 3: Summary of studied isolates and generated RNA-Seq data 



Isolate 


Assemblage 


Ref. 


#Mapped b 


%ORFs c 


%Orthologs d 


WB 


AI 




62 




37,888,422 


93.7 


99.5 


AS175 


All 






a 


36,878,839 


98.3 


99.8 


P15 


E 


189 




36,437,138 


96.3 


99.6 


GS 


B 


190 




31,869,806 


97.3 


99.7 



a The genome sequence is not published. 

b Reads of this strain that uniquely mapped to the reference genome. 

c Percentage of [annotated] ORFs with detectable transcription (FPKM>0.5). 

d Percentage of conserved four-way orthologs with detectable transcription. 



4.4.1 Low Biological and Technical Variation 

Technical variation consists of measurement imprecision introduced during li- 
brary preparation or by the sequencing instrument. The technical replicates 
of each sequencing library indicated very low technical variation (Pearson's 
r 2 =0.99). In this study technical replicates were subjected to the same li- 
brary construction procedure, and thus reflected variation of the sequencing 
instrument. On the contrary, biological replicates reflect both technical and 
biological variation. Biological variation may result from uncontrolled envi- 
ronmental cues or from stochastic variation in gene expression. Biological 
variation was low (r 2 =0.97), and we therefore concluded that gene expres- 
sion measurements were reproducible. The number of genes involved in the 
culture- induced biological response was estimated with a \ 2 test, and found to 
be 4% of the total gene content at p=0.01 (AS175). There was no meaningful 
way to group the implicated genes. 



4.4.2 Gene Expression Levels Recapitulate the Known Phylogeny 

Almost the entire G. intestinalis genome was transcribed to some extent, and 
provided transcriptional evidence for >99% of the conserved genes (Table |3|. 
Transcription levels exhibited a wide dynamic range; the fold difference of the 
median of the 5% lowest and highest expressed genes was 873X. Notably, a 
transcriptionally silent gene cluster was identified on chromosome 5 of the WB 



isolate, encompassing 28 genes in tandem (Figure 11). The genomic region 



was 41 kb and contained genes associated with replication and genes of no 
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known function. Sequence signatures of these genes were also identified in 
the GS isolate, suggesting that the region was acquired prior to the split of 
the lineage leading to the extant A and B genotypes. Because the region 
exhibited higher than average divergence between A and B, it is possible that 
it has been subjected to genetic drift without purifying selection. Saraiya et 
al. reported one ORF of this cluster to transcribe a microRNA-like small RNA 
81 , suggesting that the region may have certain functionality and provides an 
explanation to why it has not been lost. Nevertheless, the lack of detectable 
transcription suggests it may be a selected feature, possibly due to negative 
effects of parasitic DNA on fitness. 
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Figure 11: Sliding window analysis of RNA-Seq coverage on scaffold CH991767 
(WB isolate). The zoomed in region shows drop in RNA-Seq coverage along a 41 
kb region. Sequence coverage was calculated in 500 bp non-overlapping windows. 
(Y-axis) login-transformed coverage sum. (X-axis) Position along scaffold. 



Global gene expression profiles of the isolates were compared and genotype- 
specific gene expression was identified using a x 2 test. A relatively limited 
number of genes were differentially expressed (31 to 145 genes). These num- 
bers likely represent the lower detection bound since the implemented analysis 
was conservative. It remains to be investigated how many of the differentially 
expressed genes are of functional importance. 

The theory of neutral evolution states that most nucleotide changes are 
randomly introduced by neutral drift without affecting fitness 192 . On the 
contrary, positive selection is driven by advantageous mutations and is gen- 
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erally more rare. It is currently accepted that coding sequences (CDS) pre- 
dominantly evolve by neutral evolution followed by purifying selection [193] . 
Less is known about the modes of evolution acting on gene expression. We 
investigated if the rate of gene expression divergence was correlated with that 
of the CDS. CDS divergence was estimated by the rate of non-synonymous 
nucleotide substitutions (dN), while gene expression divergence was estimated 
by FPKM fold change. The analysis was performed genome-wide on the max- 
imum number of defined ortholog pairs (the precise number varied slightly 
between isolates). Cross-correlations of any two isolates resulted in Pearson's 
r ranging from 0.069 to 0.11, indicating a weak correlation of the two vari- 
ables. Conversely, there was no correlation between the rate of synonymous 
nucleotide substitutions (dS) and gene expression divergence (Pearson's r=0). 
Interestingly, highly expressed genes tended to have lower rate of divergence 
(dN). These data indicated a limited coupling between CDS and gene ex- 
pression divergence, suggesting that random drift has been the predominant 
evolutionary mode of gene expression divergence in G. intestinalis. 

4.4.3 Promiscuous Polyadenylation Sites and Unusual cis-acting 
Signals 

Polyadenylation sites (polyA sites) were precisely mapped using RNA-Seq 
data, which allowed global analysis of the G. intestinalis 'polyadenylation 
landscape'. PolyA sites were mapped using reads (polyA tags) containing 
the mRNA:polyA junction (Table [I}. Aligned polyA tags revealed 22,221 to 
49,027 distinct polyA sites (depending on the isolate; Table [3]). 



Table 4: Mapped polyadenylation sites 



Isolate 


#tags a 


#sites b 


#PACs c 


^transcripts d 


Median 3' UTR (nt) 


WB 


456,928 


49,027 


7,617 


3,884 


80 


AS175 


183,454 


37,028 


5,028 


3,057 


100 


P15 


436,720 


51,499 


8,037 


3,800 


83 


GS 


71,118 


22,221 


2,624 


1,651 


85 



a PolyA-tags with mapping quality >40. 
b Unique polyadenylation sites. 

c PolyA sites within 10 nt were clustered into polyadenylation site clusters. The numbers refer 

to number of clusters with >4 tags. 

d Transcripts with an associated polyA site. 



Each of the polyA sites represents the 3' end of an independent transcript, 
although not necessarily an mRNA. PolyA sites were found to exhibit mi- 
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croheterogeneity, i.e. imprecision of a few nucleotides. Microheterogeneity of 
polyA sites has been documented in higher eukaryotes and is attributed to 
the imprecise nature of the polyadenylation machinery 194 195 . The phe- 
nomenon is not to be confused with alternative polyadenylation, which is 
regulated by distinct polyadenylation signals. To account for this heterogene- 
ity, polyA sites within 10 nt were clustered into polyadenylation site clusters 
(PAC). PACs were assigned to the most likely ORF based on proximity, i.e. 
the 5'-most PAC counted from the translational stop codon was assigned to 
the ORF. Cloning and 3' rapid amplification of cDNA ends was performed 
on 9 genes for validation, confirming the accuracy of the mapped sites. The 
median 3' UTR length was found to be 80 nt for WB, and similar for the 
other isolates. In comparison, the median 3' UTR length of Saccharomyces 
cerevisiae is 104 nt |196| . This is longer than earlier estimates (around 30 nt) 
from a small set of highly expressed genes. Several microRNAs have been 
identified in G. intestinalis but searches for target sequences were only done 
within the first 50 bp from the stop codons. These results suggest that regu- 
lation of gene expression via microRNA binding to 3' UTRs can be common 
since many mRNAs have relatively long 3' UTRs. 

The mapped polyA sites were analyzed for putative cis-acting signals, i.e. 
polyadenylation signals (PAS). Positions -40 to -1 of each polyA site were 
searched for overrepresented hexamers using an iterative algorithm described 
by Beaudoing et at 197 . The search identified 13 prominent hexamers, which 
represent putative PAS. None of these were identical with the canonical eu- 
karyotic PAS (AAUAAA). However, 5 of the 13 putative G. intestinalis PAS 
contained the tetranucleotide 'UAAA', which is a part of the canonical eu- 
karyotic motif. The tetramer was located approximately 10 nt from the polyA 
site. Interestingly, polyA sites located in the sense orientation tended to have 
fewer of the identified hexamers in contrast to antisense or intergenic polyA 
sites. 

The nucleotide composition surrounding polyA sites was analyzed, which 
indicated a distinct pattern of AU-richness. This pattern may be required for 
recognition by the polyadenylation machinery, or for binding of polyadeny- 
lation factors. 81% of tail-to-tail gene pairs had 3' UTRs that overlapped 
the transcription unit of an adjacent gene on the opposite strand. This tran- 
scriptional organization may be a way of gene regulation, but also causes the 
production of double stranded RNA. Such pervasive transcription is often not 
compatible with functional RNA interference. While G. intestinalis has some 
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components of RNAi, it is not completely identical to that found in higher 
eukaryotes. It can therefore be speculated if G. intestinalis has lost certain 
components of RNAi in favour of genomic fidelity and minimalism. 

4.4.4 Biallelic Transcription and Correlation of Allele Expression 
and Allele Dosage 

As described in paper i, the GS isolate contains ~0.5% genome-wide heterozy- 
gosity. The RNA-Seq data were used to confirm expression of the identified 
alleles and to study allele-specific expression (ASE). Mapping bias is a major 
problem in ASE assays, which means that current mapping algorithms pref- 
erentially map one of the alleles (discussed in for example [14]). To further 
understand the extent of mapping bias, we generated and mapped simulated 
RNA-Seq data. The simulated data followed realistic error profiles, and did 
not indicate systematic mapping bias, but nevertheless indicated an inherent 
bias toward mapping of one allele. The amount of bias was influenced by 
sequence errors, but likely also other factors. 

The simulated data was modeled using a Cauchy distribution and used 
for significance testing. 98% of the genes with at least one heterozygous locus 
displayed biallelic transcription, i.e. the two alleles were identified at the tran- 
scriptional level. Of these genes, 82% indicated allelic expression imbalance 
at p=0.05, i.e. not an equal number of reads were derived from each allele. 
We examined if the allelic expression ratio corresponded to the observed al- 
lele count. Allele counts were inferred from the depth of genomic reads. For 
the vast majority of heterozygous loci there was a linear correlation between 
the RNA-Seq signal and the inferred allele count. When only heterozygous 
sites corresponding to the allele ratio A:A:B:B were investigated, 59% of the 
analyzed sites displayed expression imbalance. It can be assumed that most 
of the allelic imbalance is caused by allele dosage rather than cis or trans reg- 
ulatory differences. In conclusion, the current data indicate that both nuclei 
are transcriptionally active, and there was a correlation between expression 
level and allele dosage. Together these data indicate that transcription in G. 
intestinalis is symmetric. 

4.4.5 Only Six Genes are cis-spliced 

Only five genes are currently reported to contain introns and undergo cis- 
splicing in G. intestinalis. These genes are listed in Table [5] We performed 
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an exhaustive search for new czs-splicing events using a comparative tran- 
scriptomics approach. Putative splice junctions of WB, AS175, P15 and GS 
were first mapped with the software TopHat 198 . Manual inspection of the 
deduced splice junctions indicated a large number of dubious suggestions, as 
concluded from the proposed splice pattern and the annotation of the involved 
genes. For example, many multicopy genes were suggested to undergo exten- 
sive splicing, e.g. vsp and p21. 1 genes. The repetitive nature of these genes is 
likely to give artifacts when mapping short reads, especially since the imple- 
mented algorithms chop reads into even smaller pieces before mapping them. 
To increase the signal to noise ratio, the algorithmically identified splice sites 
were filtered according to these criteria: (i) the splicing pattern was required 
to be conserved in at least two isolates; (ii) repeated genes were discarded 
(e.g. vsp, hemp and p21.1); (Hi) the intron had to be confined to the ORF 
or the closest upstream intergenic region; and (iv) the splice pattern had to 
be supported by a minimum of 5 reads. 



Table 5: Confirmed introns in G. intestinalis 



Gene 


ID a 




Ref. 


Boundary b 


nt c 


#Isolates d 


Pos. e 


[2Fe-2S] ferredoxin 


27266 






63 




CT-AG 


35 


4 


0.05 


Rplla 


17244 


64 




62 




GT-AG 


109 


4 


0.54 


dynein light-chain 


15124 






62 




GT-AG 


32 


3 


0.05 


uncharacterized 


35332 






64 




GT-AG 


220 


2 


0.01 


uncharacterized 


15604 






65 




GT-AG 


29 


4 


0.02 


uncharacterized 


86945 


IK 


>ve 




GT-AG 


36 


4 


0.02 



a Prefix: GL50803_ 

b The intron boundaries, 5' to 3' (splice sites). 

c Length of the intron in nucleotides. 

d Number of isolates the splicing pattern was found in. 

c Intron position in the gene along the 5' to 3' axis. The position was calculated as: [position 
of first nucleotide of the intron] / [gene length]. 



The five previously reported introns were found in our data, indicating that 
the bioinformatics procedure was robust. Three of the previously confirmed 
splice variants were found in all 4 isolates, and two were found only in 3 
and 2 isolates, respectively (Table [5]). This reflects limitations in our data 
sets or genome assemblies rather than differential splicing between isolates. 
The bioinformatic search suggested 14 new intron candidates. However, PCR 
validation on genomic DNA and RNA (cDNA) subsequently rejected 13 of 
the 14 candidates, which suggests that splice site prediction using RNA-Seq 
is noisy. One new intron was confirmed by PCR, and the amplified DNA was 
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sequenced with dye-terminator sequencing, confirming splice sites. 

The novel intron was 36 nt in length and was present in an uncharacterized 
gene on chromosome 4 (Table [5] Figure 12). The intron was identified in all 



four isolates and had canonical splice boundaries (GT-AG). Removal of the 
intron extends the open reading frame with 73 codons. It is possible that some 
putative introns were missed, especially low-level cis-splicing events. However, 
the true number of introns in G. intestinalis is not likely to be much higher 
than presented here. 

atgctggattctgtgatctctctttttcttgcggcccttcgcgaagaaggtgtaccagaa 
MLDSVISLFLAALREEGVPE 

10 20 30 40 50 60 

gcccaaacactcgagctgctccagaccatccgttcctggccacagatcaaggcaagcaca 
AQTLELLQTIRSWPQIKAST 

70 80 90 100 110 120 

tatactatc gtatgatttattttttcccaacagcctaacacacag atacagacacttaat 
YTI IQTLN 

130 140 150 160 170 180 

aaccttgctaccagaggagtacgagcgtccaaagcgcttacagatattaccaccacattt 
NLATRGVRASKALTDITTTF 

190 200 210 220 230 240 

ttcacttctccgcgaatgctacagcgcaggggtctttcttgtcaagacctagatgctttt 
FTSPRMLQRRGLSCQDLDAF 

250 260 270 280 290 300 

catgactttagtggcgtgattgtaagaaattttattgtccatgggcatcagatccatggg 
HDFSGVIVRNFIVHGHQIHG 

310 320 330 340 350 360 

gttggctttactcctcttcagcttcttaga 
VGFTPLQLLR 

370 380 390 

Figure 12: 36-nt intron (underlined) in the gene GL50803-86945, encoding an 
uncharacterized protein. Only the first 390 nt of the gene is shown. The translational 
start codon is indicated in green and the conceptual translation is shown in blue. 

Five out of six introns displayed a positional bias towards the 5' end of the 
gene (Table [5| . Intron deficit and 5' bias have been observed in the two pro- 



tozoa Encephalitozoon cuniculi 199 and Guillardia theta 200 , both of which 
belong to intron-rich groups of organisms. Gene sampling of species of the 
order Oxymonadida (anaerobic flagellates found mainly in the gut of termites 
and wood-eating roaches) have found evidence of protozoa with extensive in- 
tron content |201| . These data corroborate the idea that G. intestinalis derived 
from a more intron-rich ancestor but lost introns during evolution. 
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5 Concluding remarks 

There is grandeur in this view of life, with its several powers, 
having been originally breathed into a few forms or into one; 
and that, whilst this planet has gone cycling on according to 
the fixed law of gravity, from so simple a beginning endless 
forms most beautiful and most wonderful have been, and are 
being, evolved. 

- C. Darwin (1809-1882), On the Origin of Species. 



5.0.6 Comparative Genomics 

The sequence data generated in the present studies are publicly available in 
integrative databases together with data sets from independent studies. Much 
of the data have already been utilized in several independent studies (for ex- 
ample 67 202 203 204 205| ), which underscore the value of large genomic 
data sets in parasitology: both to increase understanding of parasite biology 
and for hypothesis generation. As sequencing technologies continue to improve 
in terms of output and costs, it will be of value for the research community 
to undertake large-scale efforts to sequence multiple genomes of biologically 
relevant strains. Similar to the 1000 genomes project 206 , future efforts in 
parasitology may target several hundred or more strains. The integration of 
such data sets into community-oriented databases, for example EuPathDB 
[8], should allow researchers to take advantage of the vast amount of data. 
One challenge for the future will be to develop suitable phenotyping strate- 
gies for protozoan parasites, since sequence data from many strains may be 
of limited value if phenotypes are unknown. Second, we can learn about the 
evolutionary adaptations that led to parasitism from genome comparisons of 
even more diverged species, for example free-living or avirulent species within 
the diplomonadida and kinetoplastida. Examples of such species are Spironu- 
cleus sp. (diplomonadida) and Trypanosoma rangeli (kinetoplastida). Such 
sequencing projects are currently being undertaken and are likely to yield new 
insights into parasite evolution. 

Many genes of these parasites are currently uncharacterized. A future 
priority of parasitology should therefore be to explore the functionality of 
uncharacterized genes, since these are the genes likely to be parasite-specific 
rather than universally conserved among eukaryotes. In theory, it would also 
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be possible to find drug targets among these genes. In conclusion, many 
questions relating to the basic biology and evolutionary history of these par- 
asites are still unresolved. For example, do isolate-specific genes contribute 
to the phenotype? What is the mechanism behind horizontal gene transfer 
and how often does it happen? Why do some parasite lineages contain higher 
heterozygosity than others? Do protozoan parasites recombine and how of- 
ten? Genome sequencing may not by itself provide answers to these questions; 
rather genomic data should be used together with functional studies or other 
data sets. Ultimately, this may facilitate new insights into the basic biology 
that may lead to new treatments and control of these pathogens. 



5.0.7 The Short Transcriptome of T. cruzi 

There was no evidence of canonical small RNAs as often found in metazoans, 
e.g. microRNAs, small interfering RNAs or piwi-interacting RNA. This find- 
ing is consistent with the lack of RNA interference in T. cruzi. More than 90% 
of small RNAs were derived from known non-coding RNAs (tRNA, rRNA and 
snRNA). The most prominent category was tRNA-derived small RNAs. It re- 
mains to be determined if these small RNA classes are functional, partially 
functional or merely debris from the RNA "degradome," i.e. turnover. The 
following observations warrant further investigation of T. cruzi small RNAs: 
(i) the cleavage pattern appears to be non-random; (ii) certain small RNAs 
appear to be stable, as suggested by the copy number; (Hi) tsRNA locates 



to cytoplasmic granules 174 ; and (iv) certain tRNA isoacceptors were over- 
represented in terms of deriving tsRNAs. Interestingly, we identified a pop- 
ulation of small RNAs that were not derived from known non-coding RNAs, 
and were predicted to contain microRNA-like seed regions. The present study 
only briefly scratches the surface of the small RNA transcriptome and raises 
questions that call for further investigation. 



5.0.8 Comparative Transcriptomics of G. intestinalis 

Transcription is one key event in the translation of genotype to phenotype, 
and transcriptome studies can enhance our understanding of phenotypic dif- 
ferences and the evolutionary trajectories of pathogens. Almost the entire 
genome of G. intestinalis was transcribed in trophozoites grown in vitro, con- 
firming the promiscuous nature of transcription in this parasite. The data 
confirmed many gene models that were originally annotated without tran- 
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scriptional evidence, and also identified novel protein-coding genes. The ex- 
tent of gene expression divergence recapitulated the known phylogcny of the 
strains, suggesting that gene expression has largely evolved via genetic drift. 
Despite many genes with strain-specific expression, it is difficult or impossible 
to conclude what differences are of functional importance. Furthermore, it is 
important to note that differences in mRNA abundance may not necessarily 
result in differences at the protein level. 

Notably, a non-expressed gene locus was identified, consisting of 28 non- 
transcribed genes, which may reflect transcriptional repression by a yet-to-be 
described mechanism. Sequence signatures indicated a putative viral origin. 
Perhaps silencing operates at the level of chromatin organization. Functional 
investigation will be required to elucidate the mechanism behind this mode of 
silencing. 

Biallelic transcription was identified at most of the heterozygous loci of 
the G. intestinalis isolate GS, which were identified and described in paper 
i. Comparison of allele dosage with allelic transcription levels indicated a 
relationship between the two variables. The data corroborate previous ob- 
servations of binucleic transcription in G. intestinalis and further provide an 
association between allele dosage and transcription levels. These data suggest 
that binucleic transcription is largely symmetric, which nonetheless does not 
preclude the existence of allele-specific expression. However, the latter does 
not appear to be a general feature of transcription in G. intestinalis, which is 
also consistent with the deficit of regulation at the transcriptional level. 

Previously reported introns were confirmed and only one new intron was 
discovered, suggesting that the true number of introns is not likely to be much 
higher than this. It is tempting to speculate that G. intestinalis may have been 
more intron-rich in the past, but undergone intron-loss. Conversely, it is also 
possible that introns became more prevalent in eukaryotes after the split of G. 
intestinalis from the main eukaryotic lineage. However, it seems unlikely that 
six introns have necessitated the evolution of the relatively complex splicing 
machinery of G. intestinalis. The final evidence to settle this question would 
be the finding of a diplomonad relative with an extensive repertoire of introns, 
which is yet to be discovered. 

(8/SMSVsi 
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6 Popularvetenskaplig sammanfattning 



De tva, parasiterna Trypanosoma cruzi och Giardia intestinalis ar encelliga 
organismer som orsakar sjukdom och lidande hos flera miljoner manniskor 
varlden over. Trypanosoma cruzi ar huvudsakligen ett problem i Latinamerika 
dar den ger upphov till Chagas sjukdom. ~8 miljoner manniskor ar infekter- 
ade i Latinamerika och 11,000 avlider varje ar till foljd av sjukdomen 197]. I 
Sverige finns grovt uppskattat 1,000 infekterade personer |207| . Giardia drab- 
bar manniskor runt om i hela varlden och kan leda till diarresjukdomen gi- 
ardiasis. Migration leder till att sjukdomarna blir vanligare i Europa, Nor- 
damerika och andra delar av varlden. Bade Chagas sjukdom och Giardia 
raknas som forsummade sjukdomar, och orsakar problem i utvecklingslander 



26 97 . De namnda sjukdomarna, tillsammans med flera andra tropiska sjuk- 



domar, prioriteras ofta inte i fraga om lakemedelsutveckling eftersom det inte 
anses ekonomiskt lonsamt. 

I den har avhandlingen har jag anvant datoranalyser for att studera bi- 
ologisk information fran de tva namnda parasiterna, bland annat gener som 
kodas i dess arvsmassa. Information fran parasiterna avlases med sarskilda 
instrument som med hog noggrannhet avlaser den genetiska koden. Darefter 
analyseras informationen for att hitta biologisk relevanta monster som kan 
oka forstaelsen av parasiternas biologi. Detaljerade analyser och jamforelse 
mellan olika stammar beskrivs, och identifierar egenskaper som ar kodade 
i parasiternas DNA. De metoder som anvants avslqjar ocksa evolutionara 
monster, som kan bidra till att oka forstaelsen for hur parasiterna anpassat sig 
till manniskan och hur de undviker immunforsvaret. I den har avhandlingen 
har aven genernas aktivitet studerats, det vill saga genernas uttryck. Genut- 
trycket kan avslqja hur olika stammar skiljer sig funktionellt. Information 
fran de genomffirda studierna kan i framtiden anvandas for att designa mer 
skraddarsydda experiment som kan leda till battre behandlingsmetoder och 
diagnostik. 
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